SYDE 676 Project Report – Fall 2002 Web Document Clustering Using Phrase-based Document Similarity

نویسنده

Khaled M. Hammouda

چکیده

Measuring the similarity between documents is an essential operation in text mining, especially document clustering. The traditional method of finding the similarity between documents has always been based on extracting individual words from the documents, and using heuristics to give weights to those features. Standard methods in data mining are then used to find the similarity between documents using such features. In this project an information theoretic-based similarity measure is derived based on shared phrases between documents, rather than individual words. The basic concept is finding a metric that makes use of phrases rather than individual words. Two pairwise document similarity measures are proposed, one is corpus-dependent, and the other is corpus-independent. The corpus-independent measure allows for incremental processing of documents. Only the corpus-independent measure was evaluated in this report. The similarity measure is used for clustering web documents, which proved to have superior accuracy over traditional similarity measures. Evaluation of the clustering is performed based on Information Theory measures, specifically using the F-measure and Entropy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Weighted Phrase-Based Similarity for Web Documents Clustering

Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerat...

متن کامل

Web Document Clustering based on Document Structure

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, document structure should be reflected in the underlying data model. This paper presents a framework for web document clustering based on two important concepts. The first one is the web document structure, which is currently ...

متن کامل

Phrase-based Document Similarity Based on an Index Graph Model

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes web documents based on phrases, rather than sing...

متن کامل

Phrase based Clustering Scheme of Suffix Tree Document Clustering Model

Document clustering is one of the difficult and recent research fields in the search engine research. Most of the existing documents clustering techniques use a group of keywords from each document to cluster the documents. Document clustering arises from information retrieval domains, and “It finds grouping for a set of documents belonging to the same cluster are similar and documents belongs ...

متن کامل

Transition Potential Modeling of Land-Cover based on Similarity Weighted Instance-based Learning Procedure and Its Implication in the REDD Project Design Document

Reducing Emissions from Deforestation and Forest Degradation (REDD) is a climate change mitigation strategy employed to reduce the intensity of deforestation and GHGS emissions. In recent decades, drastic land use changes in Mazandaran province caused a substantial reduction in the amount of Hyrcanian forests. The present research based on objectives of REDD projects paid to identify of fore...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

SYDE 676 Project Report – Fall 2002 Web Document Clustering Using Phrase-based Document Similarity

نویسنده

چکیده

منابع مشابه

A Novel Weighted Phrase-Based Similarity for Web Documents Clustering

Web Document Clustering based on Document Structure

Phrase-based Document Similarity Based on an Index Graph Model

Phrase based Clustering Scheme of Suffix Tree Document Clustering Model

Transition Potential Modeling of Land-Cover based on Similarity Weighted Instance-based Learning Procedure and Its Implication in the REDD Project Design Document

عنوان ژورنال:

اشتراک گذاری